I have choosen the red wine dataset from a specific vineyard: the Portuguese “Vinho Verde”. The dataset has due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables.The inputs include objective tests (e.g. PH values) and the output is based on sensory data (median of at least 3 evaluations made by wine experts). Each expert graded the wine quality between 0 (very bad) and 10 (very excellent). The classes are ordered and not balanced (e.g. there are munch more normal wines than excellent or poor ones). Finally no missing values exist in the dataset.
Inputs variables:
1 - fixed acidity (tartaric acid - g / dm^3)
2 - volatile acidity (acetic acid - g / dm^3) 3 - citric acid (g / dm^3)
4 - residual sugar (g / dm^3) 5 - chlorides (sodium chloride - g / dm^3)
6 - free sulfur dioxide (mg / dm^3)
7 - total sulfur dioxide (mg / dm^3)
8 - density (g / cm^3)
9 - pH
10 - sulphates (potassium sulphate - g / dm^3)
11 - alcohol (% by volume)
Output variable (based on sensory data):
12 - quality (score between 0 and 10)
Description of attributes:
1 - fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily)
2 - volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste
3 - citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines
4 - residual sugar: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet
5 - chlorides: the amount of salt in the wine
6 - free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine
7 - total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine
8 - density: the density of water is close to that of water depending on the percent alcohol and sugar content
9 - pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale
10 - sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant
11 - alcohol: the percent alcohol content of the wine
12 - quality (score between 0 and 10)
What is the size of the dataset?
## [1] 1599 13
There are 13 variables because the first one (“X”) is like an ID for each
observation.
What look like the different variables?
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
Some statistics on the features
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
Let’s create a rating variable for easing the plotting:
wine$rating<-ordered(wine$quality, levels=c(1,2,3,4,5,6,7,8,9,10))
Let’s see the number of wine in each quality rating
## [1] 1.125704
The number of wines with the highest quality count only for 1.12% of the dataset.
Are these red wines with high alcohol degree?
A high number of wine have around 9.5%. The majority of wine have alcohol in a range of 9 to 12 which is a low degree for wine. However the origin of wine is known and it is normal for this location such result.
Let’s have a look at the pH of the wines
As we could expect it is in the range of 3 to 3.5. It is odd to have no wine around 3.7 or 2.9 but it might be due to a missing value rather than an error in the data. The curve looks like a normal distribution.
Let’s see the amount of salt in wine
As we could expect it is quite low.
Let’s check the density of wine which is in relation with the residual sugar and alcohol
The curve has a normal shape and it is below 1 as expected. Above 1, means that the fermentation process is not completely executed and it might provide bad wine (we will check this assumption later)
Let’s see then the amount of sugar which is also an indicator for the sweetness of the wine
The wines in the dataset are not not really sweet as the majority are below the mean of around 3g/L.
Let’s see the volatile acidity variables
The curve is little positive skew.
Let’s see the amount of sulphates in the wines
The amount of sulphates is between 0.5 to 1 g/L. It is an important information as it provides SO2 to the wine to prevent oxygenation and bacterial proliferation. If sufficient free sulfur dioxides it will lower volatile acidity. We will see this relation in the next chapter.
Let’s see then the total SO2
We can see some outliers but the majority is below 100 and it is expected as in Europe the legal limit for red wine (in general) is 150 mg/L.
Let’s see the free SO2 which is important as the molecules will protect the wine
Some statistics about SO2:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 14.00 15.87 21.00 72.00
The curve is a right-skewed distribution with a mean at 15.87.
Let’s see the amount of citric acidity which adds a freshness to the wine
The curve is positively skewed distribution.It is used for basic wine but introduces an instability in microbial environment. Due to this defect, winemakers use more often tartaric acid (here it is the fixed acidity parameter) to acidify wines.
The dataset has 1599 observations of 13 variables (including the variable rating which I created). They are numerical except for quality, X (integer values) and rating (ordinal factor). The data is also tidy.
The main interest is to determine the variable which are responsible for a good wine. In literature the quality of a wine is based on the level of citric acid, alcohol, pH and residual sugar. I will check these features more closely then.
I will check the level of density which is an indication for the fermentation process and the SO2 which is also a composant to protect the wine against oxygenation and microbial environment.
Yes I have created an ordered factor for easing the plotting in some investigation.
No, the dataset was tidy and already wrangled. I have noticed some skewed distribution and outliers but nothing that will influence our investigation heavely.
Let’s have an overview by combining variables amongst them
Volatile acidity (VA) is often associated with oxidation problems in a wine due to the fact that both result from overexposure to oxygen and/or a lack of sulfur dioxide management.
The low VA and free sulfur dioxide could mean that the excess of free SO2 maintain a good level of VA.
In winemaking, the citric-sugar co-metabolism can also increase the formation of volatile acid in wine which can affect the wine aroma negatively if present at excessive levels.
Let’s see the rating with the degree of alcohol
The rating against the sugar would give information about the sweetness of wines
The quantity of SO2 is important as it protects the wine.
Is there a discrepency in the level of SO2 amongst wine?
There is no huge difference especially with wine of quality 5 and 6.
##
## Pearson's product-moment correlation
##
## data: wine$free.sulfur.dioxide and wine$total.sulfur.dioxide
## t = 35.84, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6395786 0.6939740
## sample estimates:
## cor
## 0.6676665
Here is the strongest correlation found thanks to the help of ggpairs. However this relationship is interesting but understandable as free SO2 is included in the total sulfur dioxide.
We would normally expect that more alcohol will reduce the amount of sugar. Here it remains constant (approximately) at low level.
As we can expect more sugar will increase the density.
As the density is linked with alcohol, let’s see if it is the case
##
## Pearson's product-moment correlation
##
## data: wine$alcohol and wine$density
## t = -22.838, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.5322547 -0.4583061
## sample estimates:
## cor
## -0.4961798
As we can see more alcohol induces less density which is expected.
US legal value for volatile acidity is 1.2 g/L
As I focus on quality wine, I have checked further pH, alcohol,sugar and citric acid.I found that wines have a constant low amount of residual sugar which means they are not sweet wines and do require a high level of fixed acidity to balance sugar. I observed also many outliers with the sugar level depending on the rating.The highest rating has more alcohol, higher level of citric acid and less pH. They are only 1.12% of the dataset which means either wine are not good in general or some data are missing. The latter has better chance to be true.
All wines (except some outliers) are below the legal limit for volitale acidity. The level of VA is maintened with a low level of free sulfur dioxide which is good otherwise it might begin to smell and the wine considered as bad. Finally the relationship between free SO2 and total SO2 is positive and strong in terms of correlation. We saw that more alcohol induces less density.
The strongest positive was between free sulfur dioxide and total sulfur dioxide with a correlation of 0.67. The second one was between pH and fixed acidity at -0.68.Finally citric acid and fixed acidity at a level of 0.67.
Lets see the different variables of interest against quality
First, let’s see the sugar versus the degree of alcohol
Another way to see the same data
As we can see, the better the wine the lower the sugar level.
Now let’s see if the percentage of SO2 present against alcohol are related ?
We have limited the axis as the majority of data are within these limits as well as the wine quality(6 and above).
Let’s see by making different plots with the combinaison of alcohol, sugar and fixed acidity, if something appears
Even if these 3 parameters are linked for the quality of the wine, these 3 plots are not helpful.
Lets see with the volatile acidity
As expecter lower quality wines have higher volatile acidity
Finally let’s see if the pH is related to SO2
Let’s create a model to estimate the level of alcohol
m1 <- lm(alcohol ~ fixed.acidity, data = wine)
m2 <- update(m1, ~ . + pH)
m3 <- update(m2, ~ . + residual.sugar)
m4 <- update(m3, ~ . + citric.acid)
m5 <- update(m4, ~ . + density)
m6 <- update(m5, ~ . + chlorides)
m7 <- update(m6, ~ . + sulphates)
mtable(m1, m2, m3, m4,m5,m6,m7)
##
## Calls:
## m1: lm(formula = alcohol ~ fixed.acidity, data = wine)
## m2: lm(formula = alcohol ~ fixed.acidity + pH, data = wine)
## m3: lm(formula = alcohol ~ fixed.acidity + pH + residual.sugar, data = wine)
## m4: lm(formula = alcohol ~ fixed.acidity + pH + residual.sugar +
## citric.acid, data = wine)
## m5: lm(formula = alcohol ~ fixed.acidity + pH + residual.sugar +
## citric.acid + density, data = wine)
## m6: lm(formula = alcohol ~ fixed.acidity + pH + residual.sugar +
## citric.acid + density + chlorides, data = wine)
## m7: lm(formula = alcohol ~ fixed.acidity + pH + residual.sugar +
## citric.acid + density + chlorides + sulphates, data = wine)
##
## ====================================================================================================================
## m1 m2 m3 m4 m5 m6 m7
## --------------------------------------------------------------------------------------------------------------------
## (Intercept) 10.737*** 2.667** 2.579** 1.909* 607.523*** 611.109*** 614.217***
## (0.130) (0.887) (0.887) (0.863) (12.699) (13.251) (12.759)
## fixed.acidity -0.038* 0.090*** 0.087*** -0.024 0.560*** 0.567*** 0.571***
## (0.015) (0.020) (0.020) (0.023) (0.019) (0.020) (0.019)
## pH 2.115*** 2.120*** 2.469*** 3.934*** 3.981*** 3.954***
## (0.230) (0.230) (0.226) (0.148) (0.156) (0.150)
## residual.sugar 0.039* 0.023 0.267*** 0.268*** 0.276***
## (0.019) (0.018) (0.013) (0.013) (0.012)
## citric.acid 1.785*** 0.833*** 0.810*** 0.529***
## (0.177) (0.115) (0.118) (0.116)
## density -617.700*** -621.535*** -625.188***
## (12.940) (13.559) (13.056)
## chlorides 0.360 -0.968*
## (0.380) (0.384)
## sulphates 1.154***
## (0.102)
## --------------------------------------------------------------------------------------------------------------------
## R-squared 0.004 0.054 0.057 0.113 0.635 0.635 0.662
## adj. R-squared 0.003 0.053 0.055 0.111 0.634 0.634 0.661
## sigma 1.064 1.037 1.036 1.005 0.645 0.645 0.621
## F 6.097 45.476 31.892 50.849 554.550 462.245 445.701
## p 0.014 0.000 0.000 0.000 0.000 0.000 0.000
## Log-likelihood -2367.035 -2325.770 -2323.506 -2274.067 -1564.049 -1563.598 -1502.210
## Deviance 1807.863 1716.921 1712.066 1609.403 662.182 661.809 612.895
## AIC 4740.070 4659.541 4657.013 4560.134 3142.098 3143.196 3022.420
## BIC 4756.201 4681.049 4683.898 4592.397 3179.738 3186.213 3070.814
## N 1599 1599 1599 1599 1599 1599 1599
## ====================================================================================================================
Alcohol should be 9.4 (as per our dataset) for the following input:
thisWine = data.frame(pH = 3.51, fixed.acidity = 7.4,chlorides=0.076,
sulphates=0.56,residual.sugar = 1.9,
citric.acid = 0,density = 0.9978)
modelEstimate = predict(m7, newdata = thisWine,
interval="prediction", level = .95)
exp(modelEstimate)
## fit lwr upr
## 1 14824 4378.55 50188.07
Great wines are in balance with their 4 fundamental traits (acidity, tannin, alcohol and sweetness)
The number of wine of great quality in our dataset is 18 and they have alcohol degree of around 12.We can see that good wines have more higher percentage of SO2.The level of volatile acidity is low with a higher degree of alcohol. The level of sugar are nearly the same for each category of wine but I observed a slightly higher one for the top range. Maybe this add some complexity to the wine and was therefore more appealing for the tester.Nevertheless the result we have is expected when we know which kind of wine have been measured as it is typical from this region of Portugal.
Yes,even if 3 parameters (residual sugar, alcohol and fixed acidity) are linked for the quality of the wine, it does not mean that a plot with these features together would be helpful. If we extrapolate, it is not suprising that our model that we calculate was not efficient.
I have tried to get a model to predict the level of alcohol in a wine based on the physical variables. However the R squared is at best for 7 variables at 0.66 which is low.The model is not relevant for the purpose based on the result. Adding more variables have not improved significantly the R squared so I didn’t search further.
This graph comes from the univariant section and it shows that in the dataset a lot of wines were with low alcohol degree. This is normal for “Vinho Verde” wines and it was the start of the search to understand what variables were important for a good wine.
This graph comes from the bivariant section. It shows the mean of alcohol for each category of wine compared to the blue line which is the mean of alcohol for all wines. We can see that better wines have higher alcohol.
This graphs comes from the last section. It shows that all wines, independently from their ratings and degree of alcohol have low sugar. These wines are not really sweet then.
The dataset has 1599 observations from 12 variables. The dataset is tidy but concentrated on medium quality wine i.e. 5-6 mainly. It is difficult then to have sufficent data to understand if a specific variables would add something for the quality of wine tested or not. However the rating is subjective for everyone and even if the physical components of the wine are in the right region the combinaison of them might not be ideal. Thus the vinification is still an art rather than a science.
Others factors could have also induced differents results. It would have been nice to have wines’ prices, when they have been harvested (if it is later it would have more sugar), the kind of earth where they grow, which grapes are in the wines amongst others.
Definitevely our model we calculated was not accurate even if the degree of alcohol should have been possible to predict but I guess some additional parameters have to be taken into account in the fermentation process.
References:
http://wineserver.ucdavis.edu/industry/enology/methods_and_techniques/techniques/ph_analysis.html
https://winefolly.com/review/understanding-acidity-in-wine/
https://vinepair.com/wine-blog/7-things-you-need-to-know-about-vinho-verde/
http://winemakersacademy.com/potassium-metabisulfite-additions/